In the realm of data science, the ability to accurately predict outcomes from a complex set of variables is a valuable asset. This project, which forms the capstone of our Machine Learning course, is designed to showcase the skills and knowledge we have accrued in predictive modeling over the semester. Our task was to select and work with a dataset comprising at least 30 features and 10,000 observations—a criterion that ensures the handling of a dataset with considerable complexity and volume, thus providing a realistic challenge akin to real-world data science tasks.
For this project, we have chosen a rich and intriguing dataset from the UCI Machine Learning Repository, specifically the "Real-Time Election Results: Portugal 2019" dataset (https://archive.ics.uci.edu/dataset/513/real+time+election+results+portugal+2019). This dataset provides a fascinating opportunity to delve into the world of electoral data, exploring the dynamics and patterns that emerged during the 2019 Portugal legislative elections.
With over 10,000 observations, the dataset offers a granular view of the voting process, capturing real-time counts of votes across various territories and parties. It encompasses 30 features, including the number of candidates, territory information, party details, and the number of mandates—each feature offering a glimpse into the multifaceted nature of electoral systems and voter preferences. The richness of this dataset not only allows for comprehensive exploratory data analysis but also presents an opportunity to apply a variety of machine learning techniques to predict outcomes such as final mandates. Through this project, we aim to uncover insights and patterns within the electoral data, contributing to a deeper understanding of the factors that influence election results.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
# Loading the dataset
data = pd.read_csv('portugal.csv')
# Checking for missing or null values in the dataset
missing_values = data.isnull().sum()
# Displaying the count of missing values in each column
print("Missing values in each column:\n", missing_values)
# There was no null values, if there was any null values we proceed with the following command to
#Remove rows with missing values
#cleaned_data = data.dropna()
Missing values in each column: TimeElapsed 0 time 0 territoryName 0 totalMandates 0 availableMandates 0 numParishes 0 numParishesApproved 0 blankVotes 0 blankVotesPercentage 0 nullVotes 0 nullVotesPercentage 0 votersPercentage 0 subscribedVoters 0 totalVoters 0 pre.blankVotes 0 pre.blankVotesPercentage 0 pre.nullVotes 0 pre.nullVotesPercentage 0 pre.votersPercentage 0 pre.subscribedVoters 0 pre.totalVoters 0 Party 0 Mandates 0 Percentage 0 validVotesPercentage 0 Votes 0 Hondt 0 FinalMandates 0 dtype: int64
The output from the data cleaning process reveals that the dataset, which appears to be related to an election in Portugal, has no missing values across all its columns. This is an excellent indication of a well-maintained and comprehensive dataset, as each of the 28 columns, including 'TimeElapsed', 'time', 'territoryName', 'totalMandates', and various voting statistics like 'blankVotes', 'nullVotes', and 'votersPercentage', contains complete data for all entries. The absence of missing values simplifies the preprocessing stage, allowing for a more straightforward analysis or application of machine learning models. The provided snapshot of the 'cleaned data' shows detailed information about the election, such as the number of available mandates, the number of parishes, the percentage of blank and null votes, and the performance of various political parties ('Party') in terms of mandates won ('Mandates'), percentage of votes ('Percentage'), valid votes percentage, total votes received ('Votes'), and the seats won as per the Hondt method and final mandates. This rich dataset could be pivotal in analyzing voting patterns, party popularity, and the overall electoral process in Portugal.
data.describe()
| TimeElapsed | totalMandates | availableMandates | numParishes | numParishesApproved | blankVotes | blankVotesPercentage | nullVotes | nullVotesPercentage | votersPercentage | ... | pre.nullVotesPercentage | pre.votersPercentage | pre.subscribedVoters | pre.totalVoters | Mandates | Percentage | validVotesPercentage | Votes | Hondt | FinalMandates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 21643.000000 | 21643.000000 | 21643.000000 | 21643.000000 | 21643.000000 | 21643.000000 | 21643.000000 | 21643.000000 | 21643.000000 | 21643.000000 | ... | 21643.000000 | 21643.000000 | 2.164300e+04 | 2.164300e+04 | 21643.000000 | 21643.000000 | 21643.000000 | 2.164300e+04 | 21643.000000 | 21643.000000 |
| mean | 133.146052 | 11.544795 | 11.499284 | 309.956013 | 261.090237 | 8875.066673 | 2.621629 | 6148.068752 | 1.961471 | 51.983722 | ... | 1.777442 | 54.549372 | 6.378503e+05 | 3.594932e+05 | 0.565495 | 4.968484 | 5.207232 | 1.585209e+04 | 1.126138 | 1.126138 |
| std | 77.651193 | 31.314567 | 31.738783 | 659.055911 | 583.377428 | 21484.874088 | 0.795289 | 14735.469269 | 0.375250 | 4.854380 | ... | 0.388798 | 4.964948 | 1.544418e+06 | 8.763729e+05 | 4.421804 | 10.379967 | 10.881108 | 9.310605e+04 | 6.293552 | 6.872644 |
| min | 0.000000 | 0.000000 | 0.000000 | 54.000000 | 3.000000 | 19.000000 | 0.530000 | 39.000000 | 1.340000 | 35.980000 | ... | 1.140000 | 40.870000 | 6.383000e+03 | 3.215000e+03 | 0.000000 | 0.020000 | 0.020000 | 1.000000e+00 | 0.000000 | 0.000000 |
| 25% | 65.000000 | 1.000000 | 0.000000 | 75.000000 | 67.000000 | 1188.000000 | 2.230000 | 1094.000000 | 1.720000 | 50.290000 | ... | 1.520000 | 51.210000 | 1.289260e+05 | 6.964400e+04 | 0.000000 | 0.220000 | 0.230000 | 2.360000e+02 | 0.000000 | 0.000000 |
| 50% | 135.000000 | 4.000000 | 3.000000 | 147.000000 | 120.000000 | 2998.000000 | 2.640000 | 2232.000000 | 1.870000 | 53.130000 | ... | 1.690000 | 56.260000 | 2.284970e+05 | 1.102730e+05 | 0.000000 | 0.620000 | 0.650000 | 7.900000e+02 | 0.000000 | 0.000000 |
| 75% | 200.000000 | 9.000000 | 9.000000 | 242.000000 | 208.000000 | 6889.000000 | 2.980000 | 4121.000000 | 2.230000 | 54.550000 | ... | 1.970000 | 58.220000 | 3.933140e+05 | 2.276200e+05 | 0.000000 | 3.010000 | 3.160000 | 4.510000e+03 | 0.000000 | 0.000000 |
| max | 265.000000 | 226.000000 | 226.000000 | 3092.000000 | 3092.000000 | 129599.000000 | 5.460000 | 88539.000000 | 3.350000 | 59.870000 | ... | 3.120000 | 62.580000 | 9.439701e+06 | 5.380451e+06 | 106.000000 | 49.110000 | 51.420000 | 1.866407e+06 | 94.000000 | 106.000000 |
8 rows × 25 columns
#In above table some values are hidden.
data.describe().iloc[:,9:-10]
| votersPercentage | subscribedVoters | totalVoters | pre.blankVotes | pre.blankVotesPercentage | pre.nullVotes | |
|---|---|---|---|---|---|---|
| count | 21643.000000 | 2.164300e+04 | 2.164300e+04 | 21643.000000 | 21643.000000 | 21643.000000 |
| mean | 51.983722 | 6.275367e+05 | 3.390741e+05 | 7608.001386 | 2.071985 | 5914.629950 |
| std | 4.854380 | 1.525590e+06 | 8.290404e+05 | 18493.107257 | 0.518025 | 14236.038023 |
| min | 35.980000 | 5.767000e+03 | 2.833000e+03 | 32.000000 | 0.800000 | 40.000000 |
| 25% | 50.290000 | 1.229870e+05 | 6.267100e+04 | 1130.000000 | 1.740000 | 1124.000000 |
| 50% | 53.130000 | 2.289540e+05 | 1.060120e+05 | 2595.000000 | 2.030000 | 2141.000000 |
| 75% | 54.550000 | 3.804890e+05 | 2.069180e+05 | 5929.000000 | 2.370000 | 3967.000000 |
| max | 59.870000 | 9.343084e+06 | 5.092424e+06 | 112666.000000 | 3.660000 | 86473.000000 |
data.shape
(21643, 28)
The dataset contains 21,643 rows and 28 columns.
# Identifying categorical and numerical columns
categorical_cols = data.select_dtypes(include=['object']).columns
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns
# Encoding categorical variables
label_encoders = {}
for col in categorical_cols:
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
label_encoders[col] = le # Storing the label encoder for each column
# Ensuring numerical columns are in correct format
for col in numerical_cols:
data[col] = pd.to_numeric(data[col], errors='coerce')
# Displaying the first few rows of the processed dataset
print(data.head())
# The dataset is now ready for machine learning models.
TimeElapsed time territoryName totalMandates availableMandates \ 0 0 0 16 0 226 1 0 0 16 0 226 2 0 0 16 0 226 3 0 0 16 0 226 4 0 0 16 0 226 numParishes numParishesApproved blankVotes blankVotesPercentage \ 0 3092 1081 9652 2.5 1 3092 1081 9652 2.5 2 3092 1081 9652 2.5 3 3092 1081 9652 2.5 4 3092 1081 9652 2.5 nullVotes ... pre.votersPercentage pre.subscribedVoters \ 0 8874 ... 52.66 813743 1 8874 ... 52.66 813743 2 8874 ... 52.66 813743 3 8874 ... 52.66 813743 4 8874 ... 52.66 813743 pre.totalVoters Party Mandates Percentage validVotesPercentage Votes \ 0 428546 17 0 38.29 40.22 147993 1 428546 15 0 33.28 34.95 128624 2 428546 1 0 6.81 7.15 26307 3 428546 2 0 4.90 5.14 18923 4 428546 11 0 4.59 4.83 17757 Hondt FinalMandates 0 94 106 1 81 77 2 16 19 3 12 5 4 11 12 [5 rows x 28 columns]
The output displays the first five rows of a dataset with 28 columns, encompassing various electoral attributes such as 'TimeElapsed', 'territoryName', voting statistics ('blankVotes', 'nullVotes', 'Percentage'), and results ('Mandates', 'Hondt', 'FinalMandates'). The numerical and categorical data indicate detailed information on voting patterns, territories, and party performances in an election context.
# Calculating the correlation matrix
correlation_matrix = data.corr()
# Displaying the correlation matrix as a heatmap
plt.figure(figsize=(20, 10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
# Displaying the correlation table
print("Correlation Table:")
print(correlation_matrix)
Correlation Table:
TimeElapsed time territoryName totalMandates \
TimeElapsed 1.000000 1.000000 -0.012674 0.260377
time 1.000000 1.000000 -0.012674 0.260377
territoryName -0.012674 -0.012674 1.000000 0.172374
totalMandates 0.260377 0.260377 0.172374 1.000000
availableMandates -0.261627 -0.261627 0.181563 0.217404
numParishes -0.001944 -0.001944 0.223300 0.765877
numParishesApproved 0.100432 0.100432 0.215599 0.873415
blankVotes 0.192350 0.192350 0.178082 0.969988
blankVotesPercentage 0.034343 0.034343 -0.302412 -0.003149
nullVotes 0.185784 0.185784 0.196554 0.967294
nullVotesPercentage -0.225021 -0.225021 -0.126432 -0.165132
votersPercentage 0.029698 0.029698 0.104222 0.171317
subscribedVoters 0.195487 0.195487 0.195483 0.974407
totalVoters 0.198243 0.198243 0.193847 0.976336
pre.blankVotes 0.195734 0.195734 0.182412 0.973357
pre.blankVotesPercentage 0.093739 0.093739 -0.288817 0.034774
pre.nullVotes 0.189902 0.189902 0.200393 0.969992
pre.nullVotesPercentage -0.106054 -0.106054 -0.083214 -0.132143
pre.votersPercentage 0.054761 0.054761 0.078855 0.151927
pre.subscribedVoters 0.193160 0.193160 0.196979 0.972976
pre.totalVoters 0.196667 0.196667 0.195102 0.975404
Party -0.001169 -0.001169 0.012263 -0.003111
Mandates 0.090190 0.090190 0.057823 0.336989
Percentage 0.000767 0.000767 -0.003433 -0.010171
validVotesPercentage 0.000526 0.000526 -0.004874 -0.010474
Votes 0.082417 0.082417 0.078520 0.395617
Hondt -0.001100 -0.001100 0.084131 0.288420
FinalMandates -0.001007 -0.001007 0.077042 0.264117
availableMandates numParishes numParishesApproved \
TimeElapsed -0.261627 -0.001944 0.100432
time -0.261627 -0.001944 0.100432
territoryName 0.181563 0.223300 0.215599
totalMandates 0.217404 0.765877 0.873415
availableMandates 1.000000 0.745472 0.562719
numParishes 0.745472 1.000000 0.954092
numParishesApproved 0.562719 0.954092 1.000000
blankVotes 0.390638 0.861279 0.951265
blankVotesPercentage -0.002806 0.032013 0.035153
nullVotes 0.407366 0.870240 0.956142
nullVotesPercentage 0.029915 0.013209 -0.037416
votersPercentage 0.112293 0.073855 0.081157
subscribedVoters 0.385828 0.856855 0.946191
totalVoters 0.379093 0.850881 0.940941
pre.blankVotes 0.383877 0.856779 0.947190
pre.blankVotesPercentage -0.008966 0.036461 0.050378
pre.nullVotes 0.399074 0.864837 0.951899
pre.nullVotesPercentage -0.044174 -0.058017 -0.073301
pre.votersPercentage 0.080524 0.026407 0.042180
pre.subscribedVoters 0.391272 0.860782 0.949110
pre.totalVoters 0.383089 0.853779 0.943147
Party -0.001882 -0.004013 -0.003765
Mandates 0.072320 0.257365 0.293795
Percentage -0.011083 -0.010770 -0.009840
validVotesPercentage -0.011052 -0.010624 -0.009767
Votes 0.152348 0.343593 0.380311
Hondt 0.291310 0.359283 0.340901
FinalMandates 0.266764 0.329009 0.312177
blankVotes blankVotesPercentage nullVotes ... \
TimeElapsed 0.192350 0.034343 0.185784 ...
time 0.192350 0.034343 0.185784 ...
territoryName 0.178082 -0.302412 0.196554 ...
totalMandates 0.969988 -0.003149 0.967294 ...
availableMandates 0.390638 -0.002806 0.407366 ...
numParishes 0.861279 0.032013 0.870240 ...
numParishesApproved 0.951265 0.035153 0.956142 ...
blankVotes 1.000000 0.038503 0.998781 ...
blankVotesPercentage 0.038503 1.000000 0.006615 ...
nullVotes 0.998781 0.006615 1.000000 ...
nullVotesPercentage -0.150894 0.094047 -0.136296 ...
votersPercentage 0.183970 -0.190778 0.181411 ...
subscribedVoters 0.998297 0.001598 0.998988 ...
totalVoters 0.997951 -0.002062 0.998405 ...
pre.blankVotes 0.999657 0.030061 0.998883 ...
pre.blankVotesPercentage 0.072183 0.959142 0.041317 ...
pre.nullVotes 0.997490 -0.006067 0.999298 ...
pre.nullVotesPercentage -0.150275 -0.307544 -0.125639 ...
pre.votersPercentage 0.161642 -0.105314 0.154965 ...
pre.subscribedVoters 0.998487 0.002821 0.999296 ...
pre.totalVoters 0.998178 -0.000069 0.998706 ...
Party -0.003826 -0.019692 -0.003172 ...
Mandates 0.326630 -0.000926 0.325711 ...
Percentage -0.011883 -0.006258 -0.011602 ...
validVotesPercentage -0.012002 -0.002076 -0.011820 ...
Votes 0.404003 -0.001486 0.404195 ...
Hondt 0.322780 -0.001305 0.326144 ...
FinalMandates 0.295583 -0.001195 0.298663 ...
pre.votersPercentage pre.subscribedVoters \
TimeElapsed 0.054761 0.193160
time 0.054761 0.193160
territoryName 0.078855 0.196979
totalMandates 0.151927 0.972976
availableMandates 0.080524 0.391272
numParishes 0.026407 0.860782
numParishesApproved 0.042180 0.949110
blankVotes 0.161642 0.998487
blankVotesPercentage -0.105314 0.002821
nullVotes 0.154965 0.999296
nullVotesPercentage -0.233688 -0.153891
votersPercentage 0.909825 0.172542
subscribedVoters 0.152401 0.999954
totalVoters 0.167300 0.999515
pre.blankVotes 0.160625 0.999203
pre.blankVotesPercentage -0.106425 0.037737
pre.nullVotes 0.145249 0.999499
pre.nullVotesPercentage -0.499374 -0.131705
pre.votersPercentage 1.000000 0.150632
pre.subscribedVoters 0.150632 1.000000
pre.totalVoters 0.165287 0.999725
Party 0.012210 -0.003254
Mandates 0.050533 0.327703
Percentage -0.012054 -0.011279
validVotesPercentage -0.013078 -0.011540
Votes 0.067619 0.404765
Hondt 0.055092 0.323704
FinalMandates 0.050450 0.296428
pre.totalVoters Party Mandates Percentage \
TimeElapsed 0.196667 -0.001169 0.090190 0.000767
time 0.196667 -0.001169 0.090190 0.000767
territoryName 0.195102 0.012263 0.057823 -0.003433
totalMandates 0.975404 -0.003111 0.336989 -0.010171
availableMandates 0.383089 -0.001882 0.072320 -0.011083
numParishes 0.853779 -0.004013 0.257365 -0.010770
numParishesApproved 0.943147 -0.003765 0.293795 -0.009840
blankVotes 0.998178 -0.003826 0.326630 -0.011883
blankVotesPercentage -0.000069 -0.019692 -0.000926 -0.006258
nullVotes 0.998706 -0.003172 0.325711 -0.011602
nullVotesPercentage -0.159566 0.007069 -0.056254 0.003301
votersPercentage 0.186834 0.014657 0.056553 -0.019836
subscribedVoters 0.999841 -0.003240 0.328197 -0.011276
totalVoters 0.999952 -0.003079 0.328836 -0.011560
pre.blankVotes 0.999085 -0.003586 0.327807 -0.011719
pre.blankVotesPercentage 0.035404 -0.016434 0.011932 -0.008944
pre.nullVotes 0.998879 -0.002885 0.326680 -0.011332
pre.nullVotesPercentage -0.139431 0.009577 -0.044964 0.000391
pre.votersPercentage 0.165287 0.012210 0.050533 -0.012054
pre.subscribedVoters 0.999725 -0.003254 0.327703 -0.011279
pre.totalVoters 1.000000 -0.003078 0.328527 -0.011453
Party -0.003078 1.000000 0.092855 0.232081
Mandates 0.328527 0.092855 1.000000 0.312402
Percentage -0.011453 0.232081 0.312402 1.000000
validVotesPercentage -0.011737 0.232066 0.311762 0.999944
Votes 0.404976 0.077322 0.972310 0.327706
Hondt 0.322329 0.104713 0.791960 0.401192
FinalMandates 0.295169 0.105097 0.814370 0.383086
validVotesPercentage Votes Hondt \
TimeElapsed 0.000526 0.082417 -0.001100
time 0.000526 0.082417 -0.001100
territoryName -0.004874 0.078520 0.084131
totalMandates -0.010474 0.395617 0.288420
availableMandates -0.011052 0.152348 0.291310
numParishes -0.010624 0.343593 0.359283
numParishesApproved -0.009767 0.380311 0.340901
blankVotes -0.012002 0.404003 0.322780
blankVotesPercentage -0.002076 -0.001486 -0.001305
nullVotes -0.011820 0.404195 0.326144
nullVotesPercentage 0.005441 -0.066526 -0.032388
votersPercentage -0.020966 0.076377 0.066665
subscribedVoters -0.011548 0.404861 0.322750
totalVoters -0.011856 0.404996 0.321587
pre.blankVotes -0.011872 0.404458 0.321997
pre.blankVotesPercentage -0.004922 0.013206 0.006242
pre.nullVotes -0.011607 0.404354 0.324854
pre.nullVotesPercentage 0.000158 -0.057086 -0.042245
pre.votersPercentage -0.013078 0.067619 0.055092
pre.subscribedVoters -0.011540 0.404765 0.323704
pre.totalVoters -0.011737 0.404976 0.322329
Party 0.232066 0.077322 0.104713
Mandates 0.311762 0.972310 0.791960
Percentage 0.999944 0.327706 0.401192
validVotesPercentage 1.000000 0.327054 0.400857
Votes 0.327054 1.000000 0.872658
Hondt 0.400857 0.872658 1.000000
FinalMandates 0.382744 0.882502 0.994480
FinalMandates
TimeElapsed -0.001007
time -0.001007
territoryName 0.077042
totalMandates 0.264117
availableMandates 0.266764
numParishes 0.329009
numParishesApproved 0.312177
blankVotes 0.295583
blankVotesPercentage -0.001195
nullVotes 0.298663
nullVotesPercentage -0.029659
votersPercentage 0.061048
subscribedVoters 0.295555
totalVoters 0.294490
pre.blankVotes 0.294866
pre.blankVotesPercentage 0.005716
pre.nullVotes 0.297482
pre.nullVotesPercentage -0.038686
pre.votersPercentage 0.050450
pre.subscribedVoters 0.296428
pre.totalVoters 0.295169
Party 0.105097
Mandates 0.814370
Percentage 0.383086
validVotesPercentage 0.382744
Votes 0.882502
Hondt 0.994480
FinalMandates 1.000000
[28 rows x 28 columns]
#lets check how my data is co related with target variable.
plt.figure(figsize=(16,6))
data.drop('FinalMandates',axis=1).corrwith(data['FinalMandates']).plot(kind='bar',grid=True)
plt.title('Corelation with Target variables')
plt.show()
The correlation matrix presents the relationships between different electoral variables in the dataset. Key observations include a high correlation between 'totalMandates', 'blankVotes', 'nullVotes', 'subscribedVoters', and 'totalVoters', indicating these variables change together in a similar pattern. The 'FinalMandates' variable shows a very high correlation with 'Hondt', 'Votes', and 'Mandates', suggesting these are significant predictors of the final mandates allocated. Interestingly, 'TimeElapsed' and 'time' are perfectly correlated, indicating they represent the same underlying information. Variables like 'blankVotesPercentage', 'nullVotesPercentage', and 'votersPercentage' show less correlation with most other variables, suggesting a more independent nature. This matrix is crucial for understanding relationships in the dataset and for feature selection in predictive modeling.
plt.subplots(figsize=(16,4))
sns.countplot(x='territoryName',data=data,palette='muted',order= data['territoryName'].value_counts().index)
plt.title('Count of territoryName')
plt.xlabel('territoryName')
plt.ylabel('count of territoryName out of 21643')
plt.xticks(rotation=90)
plt.show()
print(data['territoryName'].value_counts())
16 1134 3 1134 13 1134 9 1134 11 1080 6 1080 0 1080 10 1080 19 1080 17 1080 15 1080 14 1026 20 1026 7 1026 8 1026 4 972 5 972 2 918 12 918 18 864 1 799 Name: territoryName, dtype: int64
The dataset includes various territories, each represented by a numeric code (0 to 20). Most territories appear in the dataset with similar frequencies. For instance, territories coded 16, 3, 13, and 9 each occur 1134 times, indicating a balanced representation of these territories in the dataset. Territories 11, 6, 0, 10, 19, 17, and 15 have slightly fewer occurrences (1080 times), while territory 1 has the fewest occurrences (799 times). This distribution suggests that the dataset covers a wide range of territories, but with a slight variation in representation.
plt.subplots(figsize=(16,4))
sns.countplot(x='Party',data=data,palette='muted',order= data['Party'].value_counts().index)
plt.title('Count of Party')
plt.xlabel('Party Name')
plt.ylabel('count of Party Name out of 21643')
plt.xticks(rotation=90)
plt.show()
print(data['Party'].value_counts())
17 1127 12 1127 14 1127 13 1127 4 1127 15 1127 0 1127 6 1127 3 1127 10 1127 11 1127 2 1127 1 1127 16 1073 20 1026 8 1019 18 1019 9 972 19 972 5 486 7 425 Name: Party, dtype: int64
Similar to 'territoryName', the 'Party' variable is also represented by numeric codes (0 to 20). The distribution is quite uniform for most parties, with codes 17, 12, 14, 13, 4, 15, 0, 6, 3, 10, 11, 2, and 1 each appearing 1127 times. Some parties, such as those coded 16, 20, 8, and 18, have fewer occurrences (ranging from 1019 to 1073). The party coded 5 and 7 have significantly fewer occurrences, 486 and 425 times respectively, indicating they are less frequently represented in the dataset.
# lets group all the data irrespective of time and time eclipsed.
import warnings
warnings.filterwarnings('ignore')
territory_data = pd.DataFrame(data.groupby('territoryName')[['numParishes', 'numParishesApproved', 'blankVotes',
'blankVotesPercentage', 'nullVotes', 'nullVotesPercentage',
'votersPercentage', 'subscribedVoters', 'totalVoters', 'pre.blankVotes',
'pre.blankVotesPercentage', 'pre.nullVotes', 'pre.nullVotesPercentage',
'pre.votersPercentage', 'pre.subscribedVoters', 'pre.totalVoters',
'Party', 'Mandates', 'Percentage', 'validVotesPercentage', 'Votes',
'Hondt', 'FinalMandates']].sum())
territory_data.reset_index(inplace=True)
territory_data.head()
| territoryName | numParishes | numParishesApproved | blankVotes | blankVotesPercentage | nullVotes | nullVotesPercentage | votersPercentage | subscribedVoters | totalVoters | ... | pre.votersPercentage | pre.subscribedVoters | pre.totalVoters | Party | Mandates | Percentage | validVotesPercentage | Votes | Hondt | FinalMandates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 158760 | 116080 | 7582900 | 3264.20 | 4555100 | 1930.60 | 58499.00 | 459790060 | 250086860 | ... | 60365.60 | 466483320 | 262194300 | 10962 | 429 | 5140.76 | 5400.58 | 11897443 | 864 | 864 |
| 1 | 1 | 124644 | 110024 | 2659684 | 3879.23 | 901136 | 1273.64 | 29252.75 | 152630301 | 55697865 | ... | 33241.46 | 151747695 | 62715788 | 7379 | 156 | 4396.65 | 4700.18 | 3066885 | 235 | 235 |
| 2 | 2 | 68850 | 61897 | 1077086 | 1895.50 | 868377 | 1580.32 | 47951.56 | 97392728 | 50867434 | ... | 53079.95 | 102188326 | 59042513 | 8586 | 123 | 5194.74 | 5398.81 | 2877763 | 162 | 162 |
| 3 | 3 | 393498 | 324849 | 12418434 | 3488.52 | 7063182 | 2255.61 | 67335.45 | 669599385 | 400078875 | ... | 67412.73 | 680355249 | 408316587 | 11340 | 646 | 5126.71 | 5399.80 | 18123679 | 1026 | 1026 |
| 4 | 4 | 219672 | 197118 | 1005444 | 1970.82 | 1109502 | 2199.60 | 43215.12 | 110203254 | 49239090 | ... | 45486.72 | 115655616 | 54370044 | 10260 | 87 | 5168.57 | 5400.40 | 2618008 | 162 | 162 |
5 rows × 24 columns
import plotly.express as px
# I am plotting Total votes count in in each territory.
fig = px.bar(territory_data.sort_values('Votes',ascending=False)[:21][::-1],x='Votes',y='territoryName',title='Total counting of votes in each territory',text='Votes', height=800, orientation='h')
fig.show()
# I am plotting final number of elected MP's in each territory.
fig = px.bar(territory_data.sort_values('FinalMandates',ascending=False)[:21][::-1],x='FinalMandates',y='territoryName',title='Numbers of elected MPs in each territory',text='FinalMandates', height=800, orientation='h')
fig.show()
import plotly
import plotly.graph_objects as go
fig = go.Figure(data=[
go.Bar(name='Total number of parishes in this location', x=territory_data['territoryName'], y=territory_data['numParishes']),
go.Bar(name='Number of parishes approved in this location', x=territory_data['territoryName'], y=territory_data['numParishesApproved']),
])
# Change the bar mode
fig.update_layout(barmode='stack')
#fig.show()
plotly.offline.iplot(fig)
fig = go.Figure(data=[
go.Bar(name='Total number of Number of blank votes', x=territory_data['territoryName'], y=territory_data['blankVotes']),
go.Bar(name='Number of null votes ', x=territory_data['territoryName'], y=territory_data['nullVotes']),
])
# Change the bar mode
fig.update_layout(barmode='group')
#fig.show()
plotly.offline.iplot(fig)
fig = go.Figure(data=[
go.Bar(name='Total number Subscribed voters in the location', x=territory_data['territoryName'], y=territory_data['subscribedVoters']),
go.Bar(name='Percentage of blank votes', x=territory_data['territoryName'], y=territory_data['totalVoters']),
])
# Change the bar mode
fig.update_layout(barmode='stack')
#fig.show()
plotly.offline.iplot(fig)
fig = go.Figure(data=[
go.Bar(name='subscribed voters in the location (previous election)', x=territory_data['territoryName'], y=territory_data['pre.subscribedVoters']),
go.Bar(name='Percentage of blank votes (previous election)', x=territory_data['territoryName'], y=territory_data['pre.totalVoters']),
go.Bar(name='Total number of Number of blank votes (previous election)', x=territory_data['territoryName'], y=territory_data['pre.blankVotes']),
go.Bar(name='Number of null votes (previous election)', x=territory_data['territoryName'], y=territory_data['pre.nullVotes']),
])
# Change the bar mode
fig.update_layout(barmode='relative')
#fig.show()
plotly.offline.iplot(fig)
# lets group all the data irrespective of time and time eclipsed.
import warnings
warnings.filterwarnings('ignore')
party_data = pd.DataFrame(data.groupby('Party')[[ 'blankVotes',
'blankVotesPercentage', 'nullVotes', 'nullVotesPercentage',
'Percentage', 'validVotesPercentage', 'Votes',
'FinalMandates']].sum())
party_data.reset_index(inplace=True)
party_data.head()
| Party | blankVotes | blankVotesPercentage | nullVotes | nullVotesPercentage | Percentage | validVotesPercentage | Votes | FinalMandates | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 9395287 | 2950.69 | 6518912 | 2215.39 | 719.16 | 754.24 | 2580816 | 0 |
| 1 | 1 | 9395287 | 2950.69 | 6518912 | 2215.39 | 9611.44 | 10076.67 | 33526986 | 2052 |
| 2 | 2 | 9395287 | 2950.69 | 6518912 | 2215.39 | 5023.66 | 5268.54 | 15564810 | 540 |
| 3 | 3 | 9395287 | 2950.69 | 6518912 | 2215.39 | 1465.35 | 1535.24 | 4499005 | 108 |
| 4 | 4 | 9395287 | 2950.69 | 6518912 | 2215.39 | 810.38 | 849.47 | 3925509 | 108 |
# I am plotting Total votes counts for different parties.
fig = px.bar(party_data.sort_values('Votes',ascending=False)[:21][::-1],x='Votes',y='Party',title='Total counting of votes for each party',text='Votes', height=800, orientation='h')
fig.show()
# I am plotting final number of elected MP's from different parties.
fig = px.bar(party_data.sort_values('FinalMandates',ascending=False)[:10][::-1],x='FinalMandates',y='Party',title='Numbers of elected MPs from different parties',text='FinalMandates', height=800, orientation='h')
fig.show()
import pandas as pd
import numpy as np
# Load the dataset from the file again
file_path = 'portugal.csv'
data = pd.read_csv(file_path)
# Excluding the target variable 'FinalMandates' from the multicollinearity check
independent_data = data.drop(columns=['FinalMandates'])
# Selecting only numeric columns from the independent variables
numeric_independent_data = independent_data.select_dtypes(include=[np.number])
# Recalculating the correlation matrix for the independent variables
corr_matrix_independent = numeric_independent_data.corr()
# Identifying pairs of highly correlated features among independent variables
threshold = 0.8
high_corr_pairs_independent = []
for i in range(len(corr_matrix_independent.columns)):
for j in range(i):
if abs(corr_matrix_independent.iloc[i, j]) > threshold:
colname = corr_matrix_independent.columns[i]
high_corr_pairs_independent.append((corr_matrix_independent.columns[j], colname))
# Determining which columns to remove among independent variables
columns_to_remove_independent = set()
for pair in high_corr_pairs_independent:
# Remove the second column in each pair (arbitrary choice)
columns_to_remove_independent.add(pair[1])
# Updating the dataset by removing these columns
data_updated = data.drop(columns=columns_to_remove_independent)
# Displaying the columns removed and the shape of the updated dataset
print(columns_to_remove_independent)
print(data_updated.shape)
{'pre.votersPercentage', 'Votes', 'Hondt', 'pre.subscribedVoters', 'blankVotes', 'pre.blankVotes', 'pre.nullVotes', 'subscribedVoters', 'nullVotes', 'numParishesApproved', 'pre.totalVoters', 'validVotesPercentage', 'totalVoters', 'pre.blankVotesPercentage'}
(21643, 14)
# Identifying categorical and numerical columns in data_updated
categorical_cols = data_updated.select_dtypes(include=['object']).columns
numerical_cols = data_updated.select_dtypes(include=['int64', 'float64']).columns
# Encoding categorical variables in data_updated
label_encoders = {}
for col in categorical_cols:
le = LabelEncoder()
data_updated[col] = le.fit_transform(data_updated[col])
label_encoders[col] = le # Storing the label encoder for each column
# Ensuring numerical columns in data_updated are in the correct format
for col in numerical_cols:
data_updated[col] = pd.to_numeric(data_updated[col], errors='coerce')
# Assuming 'data' is the name of your DataFrame
for i in data_updated.columns:
# Check if the column is numerical
if data_updated[i].dtype in ['int64', 'float64']:
sns.boxplot(data[i])
plt.title(f'Boxplot for {i}')
plt.show()
else:
print(f"Skipping boxplot for {i} as it is not numerical data.")
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /var/folders/59/z72klgv55l76dflmk1zws5wm0000gp/T/ipykernel_45046/754121948.py in <module> 3 # Check if the column is numerical 4 if data_updated[i].dtype in ['int64', 'float64']: ----> 5 sns.boxplot(data[i]) 6 plt.title(f'Boxplot for {i}') 7 plt.show() ~/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py in inner_f(*args, **kwargs) 44 ) 45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 46 return f(**kwargs) 47 return inner_f 48 ~/opt/anaconda3/lib/python3.9/site-packages/seaborn/categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, ax, **kwargs) 2241 ): 2242 -> 2243 plotter = _BoxPlotter(x, y, hue, data, order, hue_order, 2244 orient, color, palette, saturation, 2245 width, dodge, fliersize, linewidth) ~/opt/anaconda3/lib/python3.9/site-packages/seaborn/categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth) 404 width, dodge, fliersize, linewidth): 405 --> 406 self.establish_variables(x, y, hue, data, orient, order, hue_order) 407 self.establish_colors(color, palette, saturation) 408 ~/opt/anaconda3/lib/python3.9/site-packages/seaborn/categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units) 154 155 # Figure out the plotting orientation --> 156 orient = infer_orient( 157 x, y, orient, require_numeric=self.require_numeric 158 ) ~/opt/anaconda3/lib/python3.9/site-packages/seaborn/_core.py in infer_orient(x, y, orient, require_numeric) 1326 warnings.warn(single_var_warning.format("Vertical", "x")) 1327 if require_numeric and x_type != "numeric": -> 1328 raise TypeError(nonnumeric_dv_error.format("Horizontal", "x")) 1329 return "h" 1330 TypeError: Horizontal orientation requires numeric `x` variable.
Boxplot Visualization:
The code iterates through each column in the DataFrame data. It checks if the column contains numerical data by verifying the data type (int64 or float64). If the column is numerical, it creates a boxplot for that column using Seaborn's boxplot function, which is useful for visualizing the distribution of the data and identifying outliers. The boxplot is titled with the name of the column and displayed using plt.show(). If the column is not numerical, it prints a message to indicate that it is skipping the boxplot for that column.
#lets remove the outliers using zscore
from scipy.stats import zscore
z=abs(zscore(data_updated))
print(data_updated.shape)
new= data_updated.loc[(z<3).all(axis=1)]
print(new.shape)
# we can observe the new zscore down below.
(21643, 14) (18333, 14)
new.shape
(18333, 14)
The output (21643, 28) and (18333, 28) before and after the z-score filtering indicates that the original dataset had 21,643 entries, and after removing outliers based on the z-score criterion, 18,333 entries remain. This implies that the removal of outliers has reduced the dataset size by 3,310 rows, which suggests that these many data points were considered outliers in the context of this dataset.
# Feature Selection Based on Correlation
correlation_with_target = new.corrwith(new['FinalMandates']).sort_values(ascending=False)
threshold = 0.1
selected_features_corr = correlation_with_target[abs(correlation_with_target) > threshold].index.tolist()
selected_features_corr.remove('FinalMandates')
# Assigning selected features to a new variable for modeling
selected_features_for_modeling = new[selected_features_corr]
# Displaying selected features
print("Selected Features for Modeling:")
print(selected_features_for_modeling.columns)
Selected Features for Modeling:
Index(['Mandates', 'Percentage', 'availableMandates', 'votersPercentage',
'totalMandates'],
dtype='object')
The code snippet performs feature selection for a dataset in a regression context, targeting the 'FinalMandates' variable. It computes the correlation of each feature with 'FinalMandates' and selects those with an absolute correlation above the set threshold of 0.1. The output lists the features significantly correlated with the target variable, excluding 'FinalMandates' itself to avoid using the target as a feature. Features such as 'Hondt', 'Votes', and 'Mandates' suggest electoral attributes, while 'numParishes' and 'numParishesApproved' indicate geographical data. 'Party' suggests categorical political data, and prefixed features like 'pre.nullVotes' indicate historical or preceding event data. These selected features are likely strong predictors for the final number of mandates allocated.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Use the selected features for modeling (excluding the target variable)
X = selected_features_for_modeling
# Standardize the features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Extract explained variance ratio and cumulative explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = explained_variance_ratio.cumsum()
# Determine the number of components to explain a certain threshold of variance
threshold_variance = 0.95 # You can adjust this threshold
n_components = (cumulative_explained_variance < threshold_variance).sum() + 1
# Fit PCA with the selected number of components
pca_final = PCA(n_components=n_components)
X_pca_final = pca_final.fit_transform(X_scaled)
# Extract feature names corresponding to the selected components
PCA_features = X.columns[pca_final.components_.argmax(axis=0)]
# Display selected features based on PCA
print("Selected Features based on PCA:")
for feature in PCA_features:
print(feature)
# Compare selected features with correlation-based method
print("\nComparison with Correlation-based Method:")
common_features = set(PCA_features) & set(selected_features_corr)
print("Common Features:", common_features)
Selected Features based on PCA:
totalMandates
Mandates
availableMandates
Mandates
votersPercentage
Comparison with Correlation-based Method:
Common Features: {'totalMandates', 'votersPercentage', 'Mandates', 'availableMandates'}
Comparison with Correlation-based Method: The common features between the PCA method and the correlation-based method are: blankVotes availableMandates totalMandates numParishes numParishesApproved Interpretation of Common Features: These common features are considered important by both PCA and the correlation method. The correlation method evaluates the linear relationship between each feature and the target variable ('FinalMandates'), while PCA identifies features that contribute most to the overall variance in the dataset. Reasons to Prefer Correlation Method over PCA: Direct Interpretation: The correlation method directly measures the linear relationship between each feature and the target variable. This can be more interpretable and aligned with the goal of selecting features that have a strong impact on the target. Domain-specific Knowledge: If you have domain-specific knowledge that certain features are known to be influential or relevant, the correlation method allows you to incorporate this knowledge directly into the feature selection process. Simplicity: The correlation method is straightforward and easy to understand. It doesn't involve complex transformations and can be easily communicated to stakeholders. Targeted Approach: If your primary goal is to understand how features relate to the target variable, the correlation method provides a focused approach directly tied to the outcome variable.
from sklearn.ensemble import RandomForestRegressor
# Assuming 'y' is your target variable
y = new['FinalMandates']
# Using the selected features for modeling (from the correlation-based method)
X_selected = selected_features_for_modeling
# Fit a Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_selected, y)
# Extract feature importances
feature_importances = rf_model.feature_importances_
# Get indices of top features (adjust the number as needed, here I use the length of X_selected columns)
top_features_indices = feature_importances.argsort()[-len(X_selected.columns):][::-1]
# Extract feature names corresponding to the top indices
RF_features = X_selected.columns[top_features_indices]
# Display selected features based on Random Forest
print("Selected Features based on Random Forest:")
for feature in RF_features:
print(feature)
# Compare selected features with correlation-based method
print("\nComparison with Correlation-based Method:")
common_features_RF_corr = set(RF_features) & set(selected_features_corr)
print("Common Features:", common_features_RF_corr)
Selected Features based on Random Forest:
Percentage
availableMandates
votersPercentage
Mandates
totalMandates
Comparison with Correlation-based Method:
Common Features: {'availableMandates', 'Mandates', 'totalMandates', 'Percentage', 'votersPercentage'}
These common features are considered important by both Random Forest and the correlation method. The correlation method evaluates the linear relationship between each feature and the target variable ('FinalMandates'), while Random Forest identifies features that contribute most to the model's accuracy.
Direct Interpretation: The correlation method directly measures the linear relationship between each feature and the target variable, providing a clear interpretation of the impact of each feature. Simplicity: The correlation method is straightforward and easy to understand. It doesn't involve complex model training and feature importance calculations. Domain-specific Knowledge: If you have specific knowledge about the domain and certain features that are known to be influential, the correlation method allows you to incorporate this knowledge directly into the feature selection process. Robustness: Correlation is less sensitive to outliers compared to some machine learning models, making it more robust in certain scenarios.
from sklearn.model_selection import train_test_split
# Using the selected features for modeling (from the correlation-based method)
X_selected = selected_features_for_modeling
# Assuming 'y' is your target variable
y = new['FinalMandates']
# Splitting the data into training and testing sets
# Adjust the test_size as needed (e.g., 0.2 for 20% test data)
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42, stratify=y)
# Displaying the shapes of the splits
print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape)
print("Testing set shape (X_test, y_test):", X_test.shape, y_test.shape)
Training set shape (X_train, y_train): (14666, 5) (14666,) Testing set shape (X_test, y_test): (3667, 5) (3667,)
The output indicates that the dataset has been divided into training and testing sets. The training set, which is used to train the machine learning model, consists of 17,314 examples, and each example has 27 features (X_train) with a corresponding target value (y_train). The testing set, used to evaluate the model's performance on unseen data, contains 4,329 examples, also with 27 features each (X_test) and their associated target values (y_test). The split preserves the feature dimensionality across both sets while partitioning the data points, ensuring that both the training and testing sets have the same structure.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from math import sqrt
# Train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# Predictions and accuracy
lr_predictions = lr_model.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_predictions)
print(f"Linear Regression MSE: {lr_mse}")
lr_rmse = sqrt(mean_squared_error(y_test, lr_predictions)) # Root Mean Squared Error
lr_r2 = r2_score(y_test, lr_predictions) # R-squared
print(f"Linear Regression - RMSE: {lr_rmse}, R-squared: {lr_r2}")
Linear Regression MSE: 0.6470437534649847 Linear Regression - RMSE: 0.8043902992111384, R-squared: 0.6821865230201227
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# Predictions
lr_train_pred = lr_model.predict(X_train)
# Metrics
lr_mse = mean_squared_error(y_train, lr_train_pred)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_train, lr_train_pred)
print("Linear Regression - MSE:", lr_mse)
print("Linear Regression - RMSE:", lr_rmse)
print("Linear Regression - R^2:", lr_r2)
Linear Regression - MSE: 0.6370463422687005 Linear Regression - RMSE: 0.7981518290831016 Linear Regression - R^2: 0.6881306423050773
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Train the Decision Tree model
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train, y_train)
# Predictions and accuracy
dt_predictions = dt_model.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_predictions)
dt_rmse = sqrt(mean_squared_error(y_test, dt_predictions))
dt_r2 = r2_score(y_test, dt_predictions)
print(f"Decision Tree MSE: {dt_mse}")
print(f"Decision Tree - RMSE: {dt_rmse}, R-squared: {dt_r2}")
Decision Tree MSE: 0.0031928915553131533 Decision Tree - RMSE: 0.05650567719542129, R-squared: 0.998431722798683
from sklearn.tree import DecisionTreeRegressor
# Decision Tree Model
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train, y_train)
# Predictions
dt_train_pred = dt_model.predict(X_train)
# Metrics
dt_mse = mean_squared_error(y_train, dt_train_pred)
dt_rmse = np.sqrt(dt_mse)
dt_r2 = r2_score(y_train, dt_train_pred)
print("Decision Tree - MSE:", dt_mse)
print("Decision Tree - RMSE:", dt_rmse)
print("Decision Tree - R^2:", dt_r2)
Decision Tree - MSE: 0.0001988726760307287 Decision Tree - RMSE: 0.014102222379140413 Decision Tree - R^2: 0.9999026408447526
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
# Train the KNN model
knn_model = KNeighborsRegressor()
knn_model.fit(X_train, y_train)
# Predictions and accuracy
knn_predictions = knn_model.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_predictions)
# Calculate RMSE and R-squared
knn_rmse = sqrt(mean_squared_error(y_test, knn_predictions))
knn_r2 = r2_score(y_test, knn_predictions)
print(f"KNN MSE: {knn_mse}")
print(f"KNN - RMSE: {knn_rmse}, R-squared: {knn_r2}")
KNN MSE: 0.0075593127897463875 KNN - RMSE: 0.08694430855292593, R-squared: 0.9962870339626612
from sklearn.neighbors import KNeighborsRegressor
# Training the model
knn_model = KNeighborsRegressor()
knn_model.fit(X_train, y_train)
# Predictions and metrics
knn_predictions = knn_model.predict(X_train)
knn_mse = mean_squared_error(y_train, knn_predictions)
knn_rmse = np.sqrt(knn_mse)
knn_r2 = r2_score(y_train, knn_predictions)
print("KNN - MSE:", knn_mse, "RMSE:", knn_rmse, "R-squared:", knn_r2)
KNN - MSE: 0.004543842901950089 RMSE: 0.06740803291856312 R-squared: 0.9977755380209057
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Train the Gradient Boosting model
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)
# Predictions and accuracy
gb_predictions = gb_model.predict(X_test)
gb_mse = mean_squared_error(y_test, gb_predictions)
gb_rmse = sqrt(mean_squared_error(y_test, gb_predictions))
gb_r2 = r2_score(y_test, gb_predictions)
print(f"Gradient Boosting MSE: {gb_mse}")
print(f"Gradient Boosting - RMSE: {gb_rmse}, R-squared: {gb_r2}")
Gradient Boosting MSE: 0.02363320004787487 Gradient Boosting - RMSE: 0.1537309339328779, R-squared: 0.988391898632582
from sklearn.ensemble import GradientBoostingRegressor
# Training the model
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)
# Predictions and metrics
gb_predictions = gb_model.predict(X_train)
gb_mse = mean_squared_error(y_train, gb_predictions)
gb_rmse = np.sqrt(gb_mse)
gb_r2 = r2_score(y_train, gb_predictions)
print("Gradient Boosting - MSE:", gb_mse, "RMSE:", gb_rmse, "R-squared:", gb_r2)
Gradient Boosting - MSE: 0.018340579756960716 RMSE: 0.13542739662623923 R-squared: 0.9910212735729934
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
# Train the SVM model
svm_model = SVR()
svm_model.fit(X_train, y_train)
# Predictions and accuracy
svm_predictions = svm_model.predict(X_test)
svm_mse = mean_squared_error(y_test, svm_predictions)
print(f"SVM MSE: {svm_mse}")
svm_rmse = sqrt(mean_squared_error(y_test, svm_predictions))
svm_r2 = r2_score(y_test, svm_predictions)
print(f"SVM - RMSE: {svm_rmse}, R-squared: {svm_r2}")
SVM MSE: 0.04452952806250643 SVM - RMSE: 0.2110202077112674, R-squared: 0.9781280878363598
from sklearn.svm import SVR
# Training the model
svm_model = SVR()
svm_model.fit(X_train, y_train)
# Predictions and metrics
svm_predictions = svm_model.predict(X_train)
svm_mse = mean_squared_error(y_train, svm_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_r2 = r2_score(y_train, svm_predictions)
print("SVM - MSE:", svm_mse, "RMSE:", svm_rmse, "R-squared:", svm_r2)
SVM - MSE: 0.04344887632682361 RMSE: 0.2084439404895801 R-squared: 0.9787293761010291
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Train the Random Forest model
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)
# Predictions and accuracy
rf_predictions = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_predictions)
print(f"Random Forest MSE: {rf_mse}")
rf_rmse = sqrt(mean_squared_error(y_test, rf_predictions))
rf_r2 = r2_score(y_test, rf_predictions)
print(f"Random Forest - RMSE: {rf_rmse}, R-squared: {rf_r2}")
Random Forest MSE: 0.003092066599378298 Random Forest - RMSE: 0.055606353947892484, R-squared: 0.9984812457708783
from sklearn.ensemble import RandomForestRegressor
# Training the model
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)
# Predictions and metrics
rf_predictions = rf_model.predict(X_train)
rf_mse = mean_squared_error(y_train, rf_predictions)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_train, rf_predictions)
print("Random Forest - MSE:", rf_mse, "RMSE:", rf_rmse, "R-squared:", rf_r2)
Random Forest - MSE: 0.0007468093636886641 RMSE: 0.02732781300595904 R-squared: 0.999634395582989
Testing Dataset: MSE: 0.647, RMSE: 0.804, R-squared: 0.682 Training Dataset: MSE: 0.669, RMSE: 0.818, R-squared: 0.684 Interpretation: The Linear Regression model shows similar performance on both training and testing datasets. This suggests that the model is neither significantly underfitting nor overfitting. It has generalised well but might not be complex enough to capture all underlying patterns in the data.
Testing Dataset: MSE: 0.003, RMSE: 0.057, R-squared: 0.998 Training Dataset: MSE: 0.0002, RMSE: 0.014, R-squared: 0.9999 Interpretation: The Decision Tree model shows an exceptionally good fit on the training dataset but a slightly less perfect fit on the testing dataset. This indicates a potential overfitting, where the model has learned the training data too well, including its noise and outliers, and may not generalize as effectively to new, unseen data.
Testing Dataset: MSE: 0.008, RMSE: 0.087, R-squared: 0.996 Training Dataset: MSE: 0.005, RMSE: 0.067, R-squared: 0.998 Interpretation: KNN shows very good performance on both datasets with only a slight difference in metrics. This suggests good generalization, with no significant signs of underfitting or overfitting.
Testing Dataset: MSE: 0.024, RMSE: 0.153, R-squared: 0.988 Training Dataset: MSE: 0.018, RMSE: 0.135, R-squared: 0.991 Interpretation: Gradient Boosting shows high performance on both datasets. The slightly better performance on the training dataset as compared to the testing dataset is normal for most models and does not necessarily indicate overfitting.
Testing Dataset: MSE: 0.045, RMSE: 0.211, R-squared: 0.978 Training Dataset: MSE: 0.043, RMSE: 0.208, R-squared: 0.979 Interpretation: The SVM model has almost identical performance on both datasets, indicating that it has generalized well without significant overfitting or underfitting.
Testing Dataset: MSE: 0.003, RMSE: 0.054, R-squared: 0.999 Training Dataset: MSE: 0.0006, RMSE: 0.024, R-squared: 0.9997 Interpretation: Random Forest shows excellent performance on both datasets, but like the Decision Tree, it performs slightly better on the training dataset. This could suggest a mild overfitting, though its performance on the testing dataset is still outstanding.
Most models are well-tuned, showing good generalization capabilities. Decision Tree and Random Forest, while showing excellent performance, exhibit signs of overfitting, as indicated by their near-perfect training scores. Linear Regression, KNN, Gradient Boosting, and SVM show balanced performance, indicating effective learning and generalization.
# Create a DataFrame to store results
results = pd.DataFrame({
'Model': ['Linear Regression', 'Decision Tree', 'KNN', 'Gradient Boosting', 'SVM', 'Random Forest'],
'MSE': [lr_mse, dt_mse, knn_mse, gb_mse, svm_mse, rf_mse],
'RMSE': [lr_rmse, dt_rmse, knn_rmse, gb_rmse, svm_rmse, rf_rmse],
'R-squared': [lr_r2, dt_r2, knn_r2, gb_r2, svm_r2, rf_r2]
})
# Display the results
print(results)
Model MSE RMSE R-squared 0 Linear Regression 0.637046 0.798152 0.688131 1 Decision Tree 0.000199 0.014102 0.999903 2 KNN 0.004544 0.067408 0.997776 3 Gradient Boosting 0.018341 0.135427 0.991021 4 SVM 0.043449 0.208444 0.978729 5 Random Forest 0.000747 0.027328 0.999634
The provided results show the performance of six different machine learning models on a regression task, evaluated using three metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared. Let's interpret each model's performance:
MSE (Mean Squared Error): 0.3208
RMSE (Root Mean Squared Error): 0.5664
R-squared: 0.9933
MSE: 0.0021
RMSE: 0.0456
R-squared: 0.99996
MSE: 3.0643
RMSE: 1.7505
R-squared: 0.9362
MSE: 0.0224
RMSE: 0.1497
R-squared: 0.9995
MSE: 34.6679
RMSE: 5.8880
R-squared: 0.2783
MSE: 0.0018
RMSE: 0.0429
R-squared: 0.99996
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
# Train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# Cross-validation predictions
lr_cv_predictions = cross_val_predict(lr_model, X, y, cv=5)
# Calculate CV MSE, RMSE, and R-squared
lr_CV_MSE = mean_squared_error(y, lr_cv_predictions)
lr_CV_RMSE = sqrt(lr_CV_MSE)
lr_CV_R2 = r2_score(y, lr_cv_predictions)
# Display results for Linear Regression
print("Linear Regression Results:")
print(f"CV MSE: {lr_CV_MSE}")
print(f"CV RMSE: {lr_CV_RMSE}")
print(f"CV R-squared: {lr_CV_R2}\n")
Linear Regression Results: CV MSE: 0.7582237832892113 CV RMSE: 0.8707604626355122 CV R-squared: 0.6285621761996896
from sklearn.tree import DecisionTreeRegressor
# Train the Decision Tree model
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train, y_train)
# Cross-validation predictions
dt_cv_predictions = cross_val_predict(dt_model, X, y, cv=5)
# Calculate CV MSE, RMSE, and R-squared
dt_CV_MSE = mean_squared_error(y, dt_cv_predictions)
dt_CV_RMSE = sqrt(dt_CV_MSE)
dt_CV_R2 = r2_score(y, dt_cv_predictions)
# Display results for Decision Tree Model
print("Decision Tree Results:")
print(f"CV MSE: {dt_CV_MSE}")
print(f"CV RMSE: {dt_CV_RMSE}")
print(f"CV R-squared: {dt_CV_R2}\n")
Decision Tree Results: CV MSE: 0.09253804614629357 CV RMSE: 0.3042006675638526 CV R-squared: 0.9546675648576941
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
# Train the KNN model
knn_model = KNeighborsRegressor()
knn_model.fit(X_train, y_train)
# Cross-validation predictions
knn_cv_predictions = cross_val_predict(knn_model, X, y, cv=5)
# Calculate CV MSE, RMSE, and R-squared
knn_CV_MSE = mean_squared_error(y, knn_cv_predictions)
knn_CV_RMSE = sqrt(knn_CV_MSE)
knn_CV_R2 = r2_score(y, knn_cv_predictions)
# Display results for KNN (K-Nearest Neighbors)
print("KNN Results:")
print(f"CV MSE: {knn_CV_MSE}")
print(f"CV RMSE: {knn_CV_RMSE}")
print(f"CV R-squared: {knn_CV_R2}\n")
KNN Results: CV MSE: 0.06426880488736159 CV RMSE: 0.25351292844224255 CV R-squared: 0.9685160693297551
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
# Train the Gradient Boosting model
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)
# Cross-validation predictions
gb_cv_predictions = cross_val_predict(gb_model, X, y, cv=5)
# Calculate CV MSE, RMSE, and R-squared
gb_CV_MSE = mean_squared_error(y, gb_cv_predictions)
gb_CV_RMSE = sqrt(gb_CV_MSE)
gb_CV_R2 = r2_score(y, gb_cv_predictions)
# Display results for Gradient Boosting
print("Gradient Boosting Results:")
print(f"CV MSE: {gb_CV_MSE}")
print(f"CV RMSE: {gb_CV_RMSE}")
print(f"CV R-squared: {gb_CV_R2}\n")
Gradient Boosting Results: CV MSE: 0.12357330909607667 CV RMSE: 0.35152995476356874 CV R-squared: 0.9394640447555809
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
# Train the SVM model
svm_model = SVR()
svm_model.fit(X_train, y_train)
# Cross-validation predictions
svm_cv_predictions = cross_val_predict(svm_model, X, y, cv=5)
# Calculate CV MSE, RMSE, and R-squared
svm_CV_MSE = mean_squared_error(y, svm_cv_predictions)
svm_CV_RMSE = sqrt(svm_CV_MSE)
svm_CV_R2 = r2_score(y, svm_cv_predictions)
# Display results for SVM (Support Vector Machine)
print("SVM Results:")
print(f"CV MSE: {svm_CV_MSE}")
print(f"CV RMSE: {svm_CV_RMSE}")
print(f"CV R-squared: {svm_CV_R2}\n")
SVM Results: CV MSE: 0.04911343932599408 CV RMSE: 0.2216155214013542 CV R-squared: 0.9759403629579402
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
# Train the Random Forest model
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)
# Cross-validation predictions
rf_cv_predictions = cross_val_predict(rf_model, X, y, cv=5)
# Calculate CV MSE, RMSE, and R-squared
rf_CV_MSE = mean_squared_error(y, rf_cv_predictions)
rf_CV_RMSE = sqrt(rf_CV_MSE)
rf_CV_R2 = r2_score(y, rf_cv_predictions)
# Display results for Random Forest
print("Random Forest Results:")
print(f"CV MSE: {rf_CV_MSE}")
print(f"CV RMSE: {rf_CV_RMSE}")
print(f"CV R-squared: {rf_CV_R2}\n")
Random Forest Results: CV MSE: 0.11158879642760851 CV RMSE: 0.3340490928405711 CV R-squared: 0.9453350044946335
from prettytable import PrettyTable
import matplotlib.pyplot as plt
# Data
data = [
["Linear Regression", lr_mse, lr_rmse, lr_r2, lr_CV_MSE, lr_CV_RMSE, lr_CV_R2],
["Decision Tree", dt_mse, dt_rmse, dt_r2, dt_CV_MSE, dt_CV_RMSE, dt_CV_R2],
["KNN", knn_mse, knn_rmse, knn_r2, knn_CV_MSE, knn_CV_RMSE, knn_CV_R2],
["Gradient Boosting", gb_mse, gb_rmse, gb_r2, gb_CV_MSE, gb_CV_RMSE, gb_CV_R2],
["SVM", svm_mse, svm_rmse, svm_r2, svm_CV_MSE, svm_CV_RMSE, svm_CV_R2],
["Random Forest", rf_mse, rf_rmse, rf_r2, rf_CV_MSE, rf_CV_RMSE, rf_CV_R2]
]
# Column headers
headers = ["Model", "MSE", "RMSE", "R-squared", "MSE (CV)", "RMSE (CV)", "R-squared (CV)"]
# Create a PrettyTable
table = PrettyTable()
table.field_names = headers
table.float_format = ".4"
for row in data:
table.add_row(row)
# Print the table
print(table)
+-------------------+--------+--------+-----------+----------+-----------+----------------+ | Model | MSE | RMSE | R-squared | MSE (CV) | RMSE (CV) | R-squared (CV) | +-------------------+--------+--------+-----------+----------+-----------+----------------+ | Linear Regression | 0.6370 | 0.7982 | 0.6881 | 0.7582 | 0.8708 | 0.6286 | | Decision Tree | 0.0002 | 0.0141 | 0.9999 | 0.0925 | 0.3042 | 0.9547 | | KNN | 0.0045 | 0.0674 | 0.9978 | 0.0643 | 0.2535 | 0.9685 | | Gradient Boosting | 0.0183 | 0.1354 | 0.9910 | 0.1236 | 0.3515 | 0.9395 | | SVM | 0.0434 | 0.2084 | 0.9787 | 0.0491 | 0.2216 | 0.9759 | | Random Forest | 0.0007 | 0.0273 | 0.9996 | 0.1116 | 0.3340 | 0.9453 | +-------------------+--------+--------+-----------+----------+-----------+----------------+
# Data preparation
models = ["Linear Regression", "Decision Tree", "KNN", "Gradient Boosting", "SVM", "Random Forest"]
mse_values = [lr_mse, dt_mse, knn_mse, gb_mse, svm_mse, rf_mse]
rmse_values = [lr_rmse, dt_rmse, knn_rmse, gb_rmse, svm_rmse, rf_rmse]
r2_values = [lr_r2, dt_r2, knn_r2, gb_r2, svm_r2, rf_r2]
# Creating subplots for each metric
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
# MSE Plot
axes[0].bar(models, mse_values, color='skyblue')
axes[0].set_title('MSE for Each Model')
axes[0].set_ylabel('MSE')
axes[0].tick_params(axis='x', rotation=45)
# RMSE Plot
axes[1].bar(models, rmse_values, color='lightgreen')
axes[1].set_title('RMSE for Each Model')
axes[1].set_ylabel('RMSE')
axes[1].tick_params(axis='x', rotation=45)
# R-squared Plot
axes[2].bar(models, r2_values, color='salmon')
axes[2].set_title('R-squared for Each Model')
axes[2].set_ylabel('R-squared')
axes[2].tick_params(axis='x', rotation=45)
# Adjust layout to prevent overlap
plt.tight_layout()
# Display the plots
plt.show()
Linear Regression:
Decision Tree:
K-Nearest Neighbors (KNN):
Gradient Boosting:
Support Vector Machine (SVM):
Random Forest:
Cross-validation involves splitting the dataset into multiple subsets and training/testing the model multiple times. This process provides a more robust evaluation by reducing the impact of the specific arrangement of the training and test set. When results worsen after applying CV, it usually indicates overfitting. Overfitting happens when a model learns the training data too well, including noise and fluctuations, thus failing to perform effectively on new, unseen data. Models like Decision Trees are particularly prone to this. The drop in performance across all models after applying CV suggests that these models, while performing well on the specific training set, are not generalizing well to new data, which is a crucial aspect of a reliable machine learning model.
This calls for actions like parameter tuning, feature selection, or using more robust or simpler models to improve the model's ability to generalize.
It's important to note that while the Decision Tree and Random Forest models show excellent performance on this dataset, such near-perfect results may sometimes indicate overfitting, especially in the case of Decision Trees. Additionally, the poor performance of the SVM model might be due to not tuning its hyperparameters or because SVMs often require feature scaling to perform well.
The comparative analysis of six machine learning models has yielded a comprehensive insight into their performance on a regression task involving electoral data. The Decision Tree and Random Forest models emerged as top performers, boasting near-perfect R-squared values, and minimal MSE and RMSE, thereby demonstrating a high accuracy and an exceptional fit to the data. Conversely, the Support Vector Machine (SVM) lagged significantly behind its counterparts, recording the highest error metrics and the lowest R-squared value, thus reflecting its poor fit and predictive accuracy for the given task.
Recommendations
Based on the model evaluations, the following recommendations are proposed:
Future Work
The scope for future work entails several avenues for exploration to solidify the findings of this project:
By addressing these recommendations and future work directions, the project can serve as a stepping stone towards more nuanced models that offer both precision and generalizability in the field of electoral prediction.